Clustering algorithms are designed to answer a fairly straightforward problem:
Now “best” is a subjective qualifier that is highly dependent on the context, so clustering algorithms work around this by assuming what’s best. There’s nothing wrong with making assumptions to allow for better answers given those assumptions, but we must also be sure not to forget about those assumptions because they might not hold in the context of our implementation.
In this report, I will first discuss the K-means algorithm, which is both fairly straightforward and commonly used. I will then show some simple toy examples that “break” K-means. This will motivate the use of Spectral Clustering, a more mathematically complicated algorithm that can handle these toy examples.
The goal of K-means is to split up a data set into \(K\) groups specifically based on minimizing the sum of squared distances between each point and the center of its group.
Let’s put this in more mathematical notation:
Suppose we have a set of points \(S \subseteq \mathbb{R}^D\) and let \(k \in \mathbb{Z}^+\)
Define \(C_1, C_2, \ldots, C_k\) to be a partition of \(S\) when: \[C_1 \cup C_2 \cup \cdots \cup C_k = S \text{ and } C_i \cap C_j = \emptyset \hspace{2ex} (i \neq j)\]
Next, define the centers of each \(C_j\) to be the mean point in each \(C_j\) \(\{\mu_1, \mu_2, \ldots, \mu_k\}\)
We are trying to find the partition that minimizes the sum of squared distances between each point and the center of its assigned cluster \(C_j\), or, in other words, we are minimizing the following cost of this partition: \[cost(C_{1:k}, \mu_{1:k}) = \sum_{j = 1}^k \sum_{x \in C_j} || x - \mu_j ||^2\]
Before discussing the K-means algorithm, we need one more definition.
The Voronoi Region corresponding to a point \(\mu_i \in \mathbb{R}^D\) is all the points closest to \(\mu_i\), or: \[V(\mu_i) = \{ y \in \mathbb{R}^D : ||y - \mu_i|| \leqslant ||y - \mu_j|| \hspace{2ex} \forall j \neq i \}\]
This algorithm is based on two important ideas.
First, suppose we have some fixed partition of \(S\) and wanted to pick the best points for minimizing cost. Clearly the best points for minimizing distance within each \(C_j\) are the \(\mu_j\)’s.
Second, suppose, we have \(K\) fixed points and want to find the best partition of \(S\) into \(K\) parts. Then the best partition will be putting the points in the Voronoi Region of each point \(z_i\) in the \(i^{th}\) component of the partition, or: \[C_i = V(z_i) \cap S\]
As a simple exercise in thinking about why this second claim is true, suppose we start with points assigned based on Voronoi regions, and then we take a point \(x\) in the Voronoi region of \(z_i\) and put it in \(C_j \hspace{1ex} (j \neq i)\)
By the definition of Voronoi region, \(x\) will be farther from \(z_j\) than any other \(z_i\), thus increasing our cost.
Now, we are ready to discuss an actual algorithm. It is important to note that K-means is not normally run to find the global cost minimizing \(\mu_i\)’s but rather finds a local minimium cost.
A common way K-means is initialized is simply with \(K\) random seed points (with \(K\) specified by the user).
Then the algorithm alternates between two actions:
\[C_j \leftarrow V(z_j) \cup S\]
This sequence repeats as long as cost is lowered by more than some \(\epsilon\) threshold between steps.
There are certainly solutions where K-means works well. Take the following data for example:
Given what we’ve discussed, it looks like K-means should work well here.
As expected, K-means succeeds here.
There is a fundamental assumption in K-means though that this example hides from, and that’s the fact that K-means “thinks” clusters of data radiate around some center point. Although this seems pretty natural, this is actually a strong assumption.
So let’s break that assumption with some well-known simple examples:
In both of these examples, there are two “natural” clusters we’d like to see for both, but as we can see, K-means performs poorly:
This should not be surprising though. For both of these, there is clearly no way to place two points somewhere on the plane that would perform better. As nice as the logic of taking advantage of Voronoi regions to partition is, unfortunately, it doesn’t generalize for other kinds of clusters. Spectral clustering however, can handle these examples well.
I’m going to first focus on the math with respect to breaking data into two clusters, but I will generalize later. I will only be going through unnormalized spectral clustering. For more information on other spectral clustering methods, see von Luxburg (2007).
To explain this in detail, we first need a lot of terminology.
Let \(G = (V, E, W)\) be an undirected, weighted graph, with \(n\) vertices (e.g. the cardinality of \(V\), \(|V|\), is \(n\))
Define the weight matrix \(W_G\) of \(G\) to be an \(n\) x \(n\) matrix where \(w_{ij}\) represents the weight between vertex \(v_i\) and vertex \(v_j\)
We will make three assumptions about these weights:
Given some \(v_i \in V\), define the weighted degree of \(v_i\) as the sum of weights connected directly to \(v_i\): \[d(v_i) = d_i = \sum_{j = 1}^n w_{ij}\]
Define the degree matrix \(D_G\) of \(G\) to be a diagonal matrix where the \(i^{th}\) term on the diagonal is \(d_i\):
\[D_G =
\begin{bmatrix}
d_1 & & & \\
& d_2 & & \\
& & \ddots & \\
& & & d_n
\end{bmatrix}\]
Define the Laplacian matrix \(L_G\) of \(G\) to be: \[L_G = D_G - W_G\]
Suppose, \(A, B \subseteq V\)
Define \(W(A, B)\) to be the sum of weights connecting \(A\) to \(B\): \[W(A, B) = \sum_{i \in A, j \in B} w_{ij}\]
Given \(A \subseteq V\), define the cut of \(A\) to be: \[cut(A) = W(A, \bar{A})\]
where \(\bar{A}\) is the elements in \(V\) that are not contained in \(A\)
Let’s pause from this barrage of definitions for a moment. Remember we’re thinking about how to “best” break a graph into \(K\) clusters.
One strategy is to minimize the sum of the weights that we cut through between points when separating the graph. Perhaps we could try finding the minimum \(cut(A)\): \[\underset{A \subseteq V}{argmin} \hspace{1ex} cut(A)\]
It turns out this is easy to solve computationally, but unfortunately for us, in practice, this tends to simply separate outliers from the rest of the graph.
What if we instead try to make \(W(A, \bar{A})\) small and keep the cardinality of the two components balanced?
Define the RatioCut of \(A\) to be: \[RatioCut(A) = cut(A) \cdot \bigg( \frac{1}{|A|} + \frac{1}{|\bar{A}|} \bigg) = cut(A) \cdot \bigg( \frac{1}{|A|} + \frac{1}{n - |A|} \bigg)\]
Note that \(\bigg( \frac{1}{|A|} + \frac{1}{|\bar{A}|} \bigg)\) achieves a minimum when \(|A| \approx \frac{n}{2}\)
So what if we instead take the \(\underset{A \subseteq V}{argmin} RatioCut(A)\) ?
Now we’ve got something that sounds more reasonable, but unfortunately for us, this problem is NP-hard.
It turns out though that we can work around this issue. With a slight relaxation of this problem, we’ll be able to take advantage of the eigenvalues and eigenvectors of the Laplacian to help us, but in order to show that, we need a few more definitions and then a lot of proofs.
Graphs are abstract objects that don’t have “locations” in the Euclidean sense. We will thus keep functions that map vertices to real numbers \(f: V \mapsto \mathbb{R}\)
\[\vec{f} = \begin{bmatrix} f(v_1) \\ f(v_2) \\ \vdots \\ f(v_n) \\ \end{bmatrix} = \begin{bmatrix} f_1 \\ f_2 \\ \vdots \\ f_n \\ \end{bmatrix}\]
Let \(A \subseteq V\). Define the indicator function: \[ \boldsymbol{1}_A(i) = \begin{cases} 1 & V_i \in A \\ 0 & \text{otherwise} \end{cases}\]
Finally, since the Laplacian matrix \(L_G\) is symmetric with only real values, by the Spectral Theorem, the eigenvalues \(\lambda_1, \lambda_2, \ldots, \lambda_n\) are all real and greater than or equal to 0. We will reference an indexing them in increasing order: \[ 0 \leqslant \lambda_1 \leqslant \lambda_2 \leqslant \cdots \leqslant \lambda_n\]
Claim
\(f^T L f = \frac{1}{2} \sum_{i, j} w_{ij}(f_i - f_j)^2\)
Proof
\[\begin{aligned}
&L =
\begin{bmatrix}
\sum_j w_{1j} & -w_{12} & -w_{13} & \cdots & -w_{1n} \\
-w_{21} & \sum_j w_{2j} & -w_{32} & \cdots & -w_{2n} \\
\vdots & & \ddots & & \vdots \\
\vdots & & & \ddots & -w_{(n-1)n} \\
-w_{n1} & \cdots & \cdots & & \sum_j w_{nj} \\
\end{bmatrix} \\ \\
&Lf = L \begin{bmatrix}
f_1 \\
\vdots \\
f_n
\end{bmatrix}
= \begin{bmatrix}
f_1 \sum_j w_{1j} - \sum_{j = 2}^n f_j w_{1j} \\
\vdots \\
f_n \sum_j w_{nj} - \sum_{j = 2}^n f_j w_{nj} \\
\end{bmatrix} \\ \\
&\text{Since } w_{ii} = 0 \text{, we can rewrite this as} \\ \\
&Lf =
\begin{bmatrix}
f_1 \sum_j w_{1j} - \sum_j f_j w_{1j} \\
\vdots \\
f_n \sum_j w_{nj} - \sum_j f_j w_{nj} \\
\end{bmatrix} \\ \\
&f^TLf = \begin{bmatrix}
f_1 & \cdots & f_n
\end{bmatrix}
LF \\ \\
&= f_1^2 \sum_j w_{1j} - f_1 \sum_j f_j w_{1j} +
f_2^2 \sum_j w_{2j} - f_2 \sum_j f_j w_{2j} + \cdots +
f_n^2 \sum_j w_{nj} - f_n \sum_j f_j w_{nj} \\ \\
&= \bigg[ f_1^2 \sum_j w_{1j} + \cdots + f_n^2 \sum_j w_{nj} \bigg]
+ \bigg[- f_1 \sum_j f_j w_{1j} - \cdots - f_n \sum_j f_j w_{nj} \bigg] \\ \\
&= \bigg[ \sum_{i, j} f_i^2 w_{ij} \bigg]
- \bigg[ \sum_{i, j} f_i f_j w_{ij} \bigg] \\ \\
&= \bigg[ \frac{1}{2} \sum_{i, j} f_i^2 w_{ij} +
\frac{1}{2} \sum_{i, j} f_j^2 w_{ij} \bigg] +
\frac{1}{2}\bigg[ -2 \sum_{i, j} f_i f_j w_{ij} \bigg] \\ \\
&= \frac{1}{2} \bigg[ \sum_{i, j} f_i^2 w_{ij}
-2 \sum_{i, j} f_i f_j w_{ij}
+ \sum_{i, j} f_j^2 w_{ij} \bigg] \\ \\
&= \frac{1}{2} \bigg[ \sum_{i, j} f_i^2 w_{ij}
-2 f_i f_j w_{ij}
+ f_j^2 w_{ij} \bigg] \\ \\
&= \frac{1}{2} \sum_{i, j} w_{ij}(f_i - f_j)^2 \hspace{4ex} _\blacksquare
\end{aligned}\]
Before this next proof, we need to recall a few things from linear algebra.
Suppose \(M\) is a square matrix and \(\lambda\) is an eigenvalue of \(M\)
Define \(E_\lambda = \{ \vec{v} | M \vec{v} = \lambda \vec{v} \}\)
\(dim(E_\lambda)\) is the geometric multiplicity of \(\lambda\)
In other words, the geometric multiplicity of \(\lambda\) is the dimension of the span of eigenvectors corresponding to the eigenvalue \(\lambda\)
Also recall that the algebraic multiplicity of \(\lambda\) is simply the number of times the eigenvalue \(\lambda\) appears in the spectrum of \(M\)
Because our Laplacian matrix \(L_G\) is diagonalizable (again applying the Spectral Theorem here), it turns out that the algebraic multiplicity is equal to the geomeetric multiplicity, so from now on, we will simply use the term multiplicity.
With that, on to the next proof.
Claim
The multiplicity of \(0\) as an eigenvalue of \(L_G\) equals the number of connected components of \(G\)
Proof
Suppose \(G\) has exactly \(k\) connected components \(A_1, A_2, \ldots, A_k\) where \[ A_1 \cup A_2 \cup \cdots \cup A_k = V \hspace{2ex} \text{ and } \hspace{2ex} \forall i \neq j \hspace{1ex} A_i \cap A_j = \emptyset\]
(e.g. no edge connects \(A_i\) and \(A_j\))
Let’s re-order the vertices such that the vertices in \(A_1\) come first, then the vertices in \(A_2\), etc. The resulting matrix looks something like:
\[L_G = \begin{bmatrix} [A_1 \text{non-zero weights}] & & & 0 \\ 0 & [A_2 \text{non-zero weights}] & & \\ & & \ddots & & \\ & & & [A_k \text{non-zero weights}] \end{bmatrix}\]
If you are not convinced of this, without loss of generality, let’s look at the first row. We know \(A_1\) has non-zero weights connecting all of the values in \(A_1\) (remember we assumed \(A_1\) to be a connected component). We also know \(A_1 \cap A_j = \emptyset\) \((j \neq 1)\), which means all those weights are \(0\).
Thus, \(\forall i \in \{ 1, 2, \ldots, n \}\), \(\boldsymbol{1}_{A_i}\) are linearly independent eigenvectors of \(L_G\) all with the same eigenvalue \(\lambda = 0\)
Therefore, \(\boldsymbol{1}_{A_1}, \boldsymbol{1}_{A_2}, \ldots, \boldsymbol{1}_{A_n}\) are all in \(E_0\), which means \(dim(E_0) \geqslant k\)
Now, we will show \(dim(E_0) \leqslant k\)
Suppose \(\vec{f} \in E_0\)
Thus, \(L\vec{f} = \vec{0}\), which means \(\vec{f}^T L\vec{f} = 0\)
By the previous theorem we proved, this means: \[0 = \vec{f}^T L\vec{f} = \frac{1}{2} \sum_{i, j} w_{ij}(f_i - f_j)^2\]
Just as before, the weights between vertices in different connected components are \(0\) by assuption. Looking within connected components, \(w_{ij} > 0\) by assumption.
Since \((f_i - f_j)^2 \geqslant 0\) \(\forall i, j\), this means the only way for this sum to be zero is \(f_i = f_j\) \(\forall i, j\) where \(i\) and \(j\) are in the same connected component.
Therefore, we the value of \(\vec{f}\) on each \(A_i\) is some constant \(c_i\)
which means we can express \(\vec{f}\) as: \[\vec{f} = c_1 \cdot \boldsymbol{1}_{A_1} + c_2 \cdot \boldsymbol{1}_{A_2} + \cdots + c_k \cdot \boldsymbol{1}_{A_k}\]
Thus, \(\vec{f} \in span \{ \boldsymbol{1}_{A_1}, \boldsymbol{1}_{A_2}, \ldots, \boldsymbol{1}_{A_n} \}\)
Since \(\vec{f}\) was arbitrary, \(dim(E_0) \leqslant k\)
Therefore, \(dim(E_0) = k\), which means \(0\) has multiplicity \(k\) in \(L_G \hspace{4ex} _\blacksquare\)
So now we can write the eigenvalues of \(L_G\) as:
\[ 0 = \lambda_1 = \cdots = \lambda_k < \lambda_{k + 1} \leqslant \cdots \leqslant \lambda_n\] where \(k\) is the number of connected components of \(G\)
Also, define the Fiedler Value of \(L_G\) to be the first non-zero eigenvalue.
For our next proof we need one more definition.
Define \(f_A\) via:
\[(f_A)_i = \begin{cases} \sqrt{\frac{|\bar{A}|}{|A|}} & v_i \in A \\ -\sqrt{\frac{|A|}{|\bar{A}|}} & v_i \in \bar{A} \end{cases}\]
Claim
\(f_A^T L f_A = \frac{2}{n} \cdot RatioCut(A)\)
Proof
We just proved that \(f^T L f = \frac{1}{2} \sum_{i, j} w_{ij}(f_i -f_j)^2\)
Therefore, \(f_A^T L f_A = \frac{1}{2} \sum_{i, j} w_{ij}(f_A(i) - f_A(j))^2\)
By construction of \(f_A\), whenever \(i, j \in A\) or \(i, j \in \bar{A}\), \(f_A(i) - f_A(j) = 0\)
We can thus write \(2 \cdot f_A^T L f_A\) as:
\[\begin{aligned} 2 \cdot f_A^T L f_A &= \sum_{i \in A, j \in \bar{A}} w_{ij} (f_A(i) - f_A(j))^2 \\ \\ &= \sum_{i \in A, j \in \bar{A}} w_{ij} \bigg( \sqrt{\frac{|\bar{A}|}{|A|}} - -\sqrt{\frac{|A|}{|\bar{A}|}} \bigg)^2\\ \\ &= \sum_{i \in A, j \in \bar{A}} w_{ij} \bigg( \sqrt{\frac{|\bar{A}|}{|A|}} + \sqrt{\frac{|A|}{|\bar{A}|}}\bigg)^2 \\ \\ &= \sum_{i \in A, j \in \bar{A}} w_{ij} \bigg( \sqrt{\frac{|\bar{A}|}{|A|}}^2 + 2 \cdot \sqrt{\frac{|\bar{A}|}{|A|}} \cdot \sqrt{\frac{|A|}{|\bar{A}|}} + \sqrt{\frac{|A|}{|\bar{A}|}}^2\bigg) \\ \\ &= \sum_{i \in A, j \in \bar{A}} w_{ij} \bigg(\frac{|\bar{A}|}{|A|} + 2 + \frac{|A|}{|\bar{A}|}\bigg) \\ \\ &= \sum_{i \in A, j \in \bar{A}} w_{ij} \bigg(\frac{n - |A|}{|A|} + \frac{|A| + n - |A|}{|\bar{A}|} + 1\bigg) \\ \\ &= \sum_{i \in A, j \in \bar{A}} w_{ij} \bigg(\frac{n}{|A|} - 1 + \frac{n}{|\bar{A}|} + 1\bigg) \\ \\ &= \sum_{i \in A, j \in \bar{A}} w_{ij} \bigg(\frac{1}{|A|} + \frac{1}{|\bar{A}|}\bigg) \\ \\ &= n \cdot cut(A) \bigg(\frac{1}{|A|} + \frac{1}{|\bar{A}|}\bigg) \\ \\ &= n \cdot RatioCut(A) \\ \\ \end{aligned}\]
Thus, \(RatioCut(A) = \frac{2}{n} f_A^T L f_A \hspace{4ex} _\blacksquare\)
We just need one more simple proof before we can put this all together.
Claim \(f_A \bullet \boldsymbol{1}_n = 0\)
Proof
\[\begin{aligned} f_A \bullet \boldsymbol{1}_n &= |A| \sqrt{\frac{|\bar{A}|}{|A|}} - |\bar{A}| \sqrt{\frac{|A|}{|\bar{A}|}} \\ \\ &= \sqrt{\frac{|\bar{A}| \cdot |A|^2}{|A|}} - \sqrt{\frac{|A| \cdot |\bar{A}|^2}{|\bar{A}|}} \\ \\ &= \sqrt{|\bar{A}| \cdot |A|} - \sqrt{|\bar{A}| \cdot |A|} \\ \\ &= 0 \hspace{4ex} _\blacksquare \end{aligned}\]
Our goal right now is to divide a data set into two clusters while keeping the clusters balanced in size, which gave us \(RatioCut\)
We showed that \(f_A^T L f_A = \frac{2}{n} \cdot RatioCut(A)\) where \[(f_A)_i = \begin{cases} \sqrt{\frac{|\bar{A}|}{|A|}} & v_i \in A \\ -\sqrt{\frac{|A|}{|\bar{A}|}} & v_i \in \bar{A} \end{cases}\]
We also know \(f_A\) is orthogonal to the all ones vector \(\boldsymbol{1}_V\)
With this, we can rephrase the \(RatioCut\) Problem as a constrained optimization problem.
We want to find \[\begin{aligned} &\underset{A \subseteq V}{argmin} RatioCut(A) \\ &= \underset{A \subseteq V}{argmin} \frac{2}{n} f_A^T L f_A \\ &= \underset{A \subseteq V}{argmin} f_A^T L f_A \end{aligned}\]
subject to the constraint \(f_A \bullet \boldsymbol{1}_V = 0\)
This problem is NP-hard, but we can relax this problem to: \[\underset{A \subseteq V}{argmin} f^T L f\]
subject to the constraint \(f \bullet \boldsymbol{1}_V = 0\)
Conveniently for us, by the Rayleigh-Ritz Theorem, this \(argmin\) is the eigenvector corresponding to the Fiedler value. See the Appendix for more discussion and proof of the Rayleigh-Ritz Theorem.
To turn this into an assignment of points into two clusters, remember our original construction of \(f_A\). In particular, recall that \(f_A\) maps vertices in \(A\) to positive values, and \(f_A\) maps vertices in \(\bar{A}\) to negative values.
At this point, we could actually assign clusters based on whether \(f\) is positive or negative, but to be more cautious, we can use K-means with \(k = 2\).
It’s important to note here that we are hoping that the answer from the relaxed problem is close to the true answer. This certainly is not always the case, as discussed in Section 5.4 of von Luxburg (2007).
First, we must address how to create the weights between data points when we build the graph. I’ve implemented the unnormalized spectral clustering as described in von Luxburg (2007). I used the Gaussian similarity function to create the weights:
\[s(x_i, x_j) = exp\bigg(- \frac{||x_i - x_j||^2}{2\sigma^2} \bigg)\]
This results in points that are closer together having a larger similarity, and this similarity decays exponentially as the euclidean distance between points increases.
It’s important to note that the choice of \(\sigma\) substantially affects the performance of Spectral Clustering. In the context of the Gaussian similarity function, if the function decays too quickly or too slowly, then the \(RatioCut\) that would result from a perfectly calibrated \(\sigma\) might not look all that different from other cuts. Although this is not an issue with K-means, spectral clustering can outperform better with the proper calibration of \(\sigma\), as I will demonstrate with several simple examples.
The code is up on my Github page. The basic structure of my algorithm for assigning \(n\) data points into \(K\) clusters is as follows:
Build the \(n\) x \(n\) weight matrix using the Gaussian similarity function
Construct the \(n\) x \(n\) Laplacian matrix and find its eigenvalues and eigenvectors
Keep the \(K\) eigenvectors (\(n\) dimensional) corresponding to the \(K\) smallest eigenvalues (note because I’m using the Gaussian similarity function, the resulting graph of the data using this algorithm is fully connected. Therefore, there will only be 1 eigenvalue equal to 0 / approximately equal to 0 due to computer error)
Cluster the \(n\) rows of this resulting matrix using K-means into \(K\) clusters
Output an \(n\)-dimensional numeric vector with values between 1 and \(K\), representing each point’s cluster assignment
The results look much better than K-means for our more complicated examples:
It also performs just as well for the examples that play nice with K-means:
As one more example of Spectral Clustering’s “respect” for the structure of data when partitioning it, I include the Swiss Roll as a 3-dimensional example. You’ll never guess how they came up with the name for this data set.
Note how the resulting clusters when implementing K-means clustering fail to stay within 1 layer, which is to be expected when clustering based on euclidean distance from a central point, but K-means maintains this within-layer integrity.
Here, I will go through the set-up and proof of the Rayleigh-Ritz Theorem.
Suppose \(M\) is a real and symmetric \(D\) x \(D\) matrix.
Let \(\vec{x} \in \mathbb{R}^D\)
Define the Rayleigh Quotient \(R_M: \mathbb{R}^D \mapsto \mathbb{R}\) via: \[R_M(\vec{x}) = \frac{\vec{x}^T M \vec{x}}{\vec{x}^T \vec{x}}\]
Thm (Rayleigh-Ritz)
Suppose \(M\) is real and symmetric with spectrum \(\vec{v}_1, \vec{v}_2, \ldots, \vec{v}_D\) and \(\lambda_1 \geqslant \lambda_2 \geqslant \cdots \geqslant \lambda_D\)
Then the maximum value of \(R_M\) is \(\lambda_1\) and it’s taken at \(\vec{v}_1\)
Furthermore, the minimum value of \(R_M\) is \(\lambda_D\) and it’s taken at \(\vec{v}_D\)
Proof
Let \(\vec{u} \in \mathbb{R}^D\) where \(||\vec{u}||^2 = 1\) be given
Since \(M\) is symmetric and real, by the Spectral Theorem, there exists unique \(c_1, c_2, \ldots, c_D \in \mathbb{R}\) such that \(\vec{u} = c_1 \vec{v}_1 + c_2 \vec{v}_2 + \cdots + c_D \vec{v}_D\)
(Note \(c_1^2 + c_2^2 + \cdots + c_D^2 = 1\))
We can express \(R_M(\vec{u})\) as: \[\begin{aligned} R_M(\vec{u}) = \vec{u}^T M \vec{u} &= (c_1 \vec{v}_1^T + \cdots + c_D \vec{v}_D^T) M (c_1 \vec{v}_1 + \cdots + c_D \vec{v}_D) \\ &= (c_1 \vec{v}_1^T + \cdots + c_D \vec{v}_D^T) (\lambda_1 \cdot c_1 \vec{v}_1 + \cdots + \lambda_D \cdot c_D \vec{v}_D) \\ \end{aligned}\]
By the Spectral Theorem, \(\forall i \neq j, \vec{v}_i \bullet \vec{v}_j = 0\), and clearly \(\forall i, \vec{v}_i \bullet \vec{v}_i = 1\), giving us: \[R_M(\vec{u}) = \lambda_1 c_1^2 + \cdots + \lambda_D c_D^2\]
Thus, we maximize \(R_M\) by setting \(c_1 = 1\) and \(c_2 = c_3 = \cdots = c_D = 0\), giving us a value of \(\lambda_1\) at \(\vec{u} = \vec{v}_1\)
and we minimize \(R_M\) by setting \(c_D = 1\) and \(c_1 = c_2 = \cdots = c_{D-1} = 0\), giving us a value of \(\lambda_D\) at \(\vec{u} = \vec{v}_D \hspace{4ex} _\blacksquare\)
I was taught this material in Paul Bendich’s High-Dimensional Data Analysis Class (Math 465) at Duke in the Fall of 2017, so I’d like to thank Paul for giving a very clear explanation on a very complicated topic.
I also referenced a paper by von Luxburg (2007) in writing this document.\(\hspace{4ex} _\blacksquare\)